GPU acceleration with LiteRT Next

Graphics Processing Units (GPUs) are commonly used for deep learning acceleration due to their massive parallel throughput compared to CPUs. LiteRT Next simplifies the process of using GPU acceleration by allowing users to specify the hardware acceleration as a parameter when creating a Compiled Model (CompiledModel). LiteRT Next also uses a new and improved GPU acceleration implementation, not offered by LiteRT.

With LiteRT Next's GPU acceleration, you can create GPU-friendly input and output buffers, achieve zero-copy with your data in GPU memory, and execute tasks asynchronously to maximize parallelism.

For example implementations of LiteRT Next with GPU support, refer to the following demo applications:

Add GPU dependency

Use the following steps to add GPU dependency to your Kotlin or C++ application.

Kotlin

For Kotlin users, the GPU accelerator is built-in and does not require additional steps beyond the Get Started guide.

C++

For C++ users, you must build the dependencies of the application with LiteRT GPU acceleration. The cc_binary rule that packages the core application logic (e.g., main.cc) requires the following runtime components:

  • LiteRT C API shared library: the data attribute must include the LiteRT C API shared library (//litert/c:litert_runtime_c_api_shared_lib) and GPU-specific components (@litert_gpu//:jni/arm64-v8a/libLiteRtGpuAccelerator.so).
  • Attribute dependencies: The deps attribute typically includes GLES dependencies gles_deps(), and linkopts typically includes gles_linkopts(). Both are highly relevant for GPU acceleration, since LiteRT often uses OpenGLES on Android.
  • Model files and other assets: Included through the data attribute.

The following is an example of a cc_binary rule:

cc_binary(
    name = "your_application",
    srcs = [
        "main.cc",
    ],
    data = [
        ...
        # litert c api shared library
        "//litert/c:litert_runtime_c_api_shared_lib",
        # GPU accelerator shared library
        "@litert_gpu//:jni/arm64-v8a/libLiteRtGpuAccelerator.so",
    ],
    linkopts = select({
        "@org_tensorflow//tensorflow:android": ["-landroid"],
        "//conditions:default": [],
    }) + gles_linkopts(), # gles link options
    deps = [
        ...
        "//litert/cc:litert_tensor_buffer", # litert cc library
        ...
    ] + gles_deps(), # gles dependencies
)

This setup allows your compiled binary to dynamically load and use the GPU for accelerated machine learning inference.

Get started

To get started using the GPU accelerator, pass the GPU parameter when creating the Compiled Model (CompiledModel). The following code snippet shows a basic implementation of the entire process:

C++

// 1. Load model
LITERT_ASSIGN_OR_RETURN(auto model, Model::CreateFromFile("mymodel.tflite"));

// 2. Create a compiled model targeting GPU
LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model, CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));

// 3. Prepare input/output buffers
LITERT_ASSIGN_OR_RETURN(auto input_buffers, compiled_model.CreateInputBuffers());
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());

// 4. Fill input data (if you have CPU-based data)
input_buffers[0].Write<float>(absl::MakeConstSpan(cpu_data, data_size));

// 5. Execute
compiled_model.Run(input_buffers, output_buffers);

// 6. Access model output
std::vector<float> data(output_data_size);
output_buffers.Read<float>(absl::MakeSpan(data));

Kotlin

// Load model and initialize runtime
val  model =
    CompiledModel.create(
        context.assets,
        "mymodel.tflite",
        CompiledModel.Options(Accelerator.GPU),
        env,
    )

// Preallocate input/output buffers
val inputBuffers = model.createInputBuffers()
val outputBuffers = model.createOutputBuffers()

// Fill the first input
inputBuffers[0].writeFloat(FloatArray(data_size) { data_value /* your data */ })

// Invoke
model.run(inputBuffers, outputBuffers)

// Read the output
val outputFloatArray = outputBuffers[0].readFloat()

For more information, see the Get Started with C++ or Get Started with Kotlin guides.

LiteRT Next GPU Accelerator

The new GPU Accelerator, available only with LiteRT Next, is optimized to handle AI workloads, like large matrix multiplications and KV cache for LLMs, more efficiently than previous versions. The LiteRT Next GPU Accelerator features the following key improvements over the LiteRT version:

  • Extended Operator Coverage: Handle larger, more complex neural networks.
  • Better Buffer Interoperability: Enable direct usage of GPU buffers for camera frames, 2D textures, or large LLM states.
  • Async Execution support: Overlap CPU pre-processing with GPU inference.

Zero-copy with GPU acceleration

Using zero-copy enables a GPU to access data directly in its own memory without the need for the CPU to explicitly copy that data. By not copying data to and from CPU memory, zero-copy can significantly reduce end-to-end latency.

The following code is an example implementation of Zero-Copy GPU with OpenGL, an API for rendering vector graphics. The code passes images in the OpenGL buffer format directly to LiteRT Next:

// Suppose you have an OpenGL buffer consisting of:
// target (GLenum), id (GLuint), size_bytes (size_t), and offset (size_t)
// Load model and compile for GPU
LITERT_ASSIGN_OR_RETURN(auto model, Model::CreateFromFile("mymodel.tflite"));
LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
    CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));

// Create a TensorBuffer that wraps the OpenGL buffer.
LITERT_ASSIGN_OR_RETURN(auto tensor_type, model.GetInputTensorType("input_tensor_name"));
LITERT_ASSIGN_OR_RETURN(auto gl_input_buffer, TensorBuffer::CreateFromGlBuffer(env,
    tensor_type, opengl_buffer.target, opengl_buffer.id, opengl_buffer.size_bytes, opengl_buffer.offset));
std::vector<TensorBuffer> input_buffers{gl_input_buffer};
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());

// Execute
compiled_model.Run(input_buffers, output_buffers);

// If your output is also GPU-backed, you can fetch an OpenCL buffer or re-wrap it as an OpenGL buffer:
LITERT_ASSIGN_OR_RETURN(auto out_cl_buffer, output_buffers[0].GetOpenClBuffer());

Asynchronous execution

LiteRT's asynchronous methods, like RunAsync(), let you schedule GPU inference while continuing other tasks using the CPU or the NPU. In complex pipelines, GPU is often used asynchronously alongside CPU or NPUs.

The following code snippet builds on the code provided in the Zero-copy GPU acceleration example. The code uses both CPU and GPU asynchronously and attaches a LiteRT Event to the input buffer. LiteRT Event is responsible for managing different types of synchronization primitives, and the following code creates a managed LiteRT Event object of type LiteRtEventTypeEglSyncFence. This Event object ensures that we don't read from the input buffer until the GPU is done. All this is done without involving the CPU.

LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
    CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));

// 1. Prepare input buffer (OpenGL buffer)
LITERT_ASSIGN_OR_RETURN(auto gl_input,
    TensorBuffer::CreateFromGlBuffer(env, tensor_type, opengl_tex));
std::vector<TensorBuffer> inputs{gl_input};
LITERT_ASSIGN_OR_RETURN(auto outputs, compiled_model.CreateOutputBuffers());

// 2. If the GL buffer is in use, create and set an event object to synchronize with the GPU.
LITERT_ASSIGN_OR_RETURN(auto input_event,
    Event::CreateManagedEvent(env, LiteRtEventTypeEglSyncFence));
inputs[0].SetEvent(std::move(input_event));

// 3. Kick off the GPU inference
compiled_model.RunAsync(inputs, outputs);

// 4. Meanwhile, do other CPU work...
// CPU Stays busy ..

// 5. Access model output
std::vector<float> data(output_data_size);
outputs[0].Read<float>(absl::MakeSpan(data));

Supported models

LiteRT Next supports GPU acceleration with the following models. Benchmark results are based on tests run on a Samsung Galaxy S24 device.

Model LiteRT GPU Acceleration LiteRT GPU (ms)
hf_mms_300m Fully delegated 19.6
hf_mobilevit_small Fully delegated 8.7
hf_mobilevit_small_e2e Fully delegated 8.0
hf_wav2vec2_base_960h Fully delegated 9.1
hf_wav2vec2_base_960h_dynamic Fully delegated 9.8
isnet Fully delegated 43.1
timm_efficientnet Fully delegated 3.7
timm_nfnet Fully delegated 9.7
timm_regnety_120 Fully delegated 12.1
torchaudio_deepspeech Fully delegated 4.6
torchaudio_wav2letter Fully delegated 4.8
torchvision_alexnet Fully delegated 3.3
torchvision_deeplabv3_mobilenet_v3_large Fully delegated 5.7
torchvision_deeplabv3_resnet101 Fully delegated 35.1
torchvision_deeplabv3_resnet50 Fully delegated 24.5
torchvision_densenet121 Fully delegated 13.9
torchvision_efficientnet_b0 Fully delegated 3.6
torchvision_efficientnet_b1 Fully delegated 4.7
torchvision_efficientnet_b2 Fully delegated 5.0
torchvision_efficientnet_b3 Fully delegated 6.1
torchvision_efficientnet_b4 Fully delegated 7.6
torchvision_efficientnet_b5 Fully delegated 8.6
torchvision_efficientnet_b6 Fully delegated 11.2
torchvision_efficientnet_b7 Fully delegated 14.7
torchvision_fcn_resnet50 Fully delegated 19.9
torchvision_googlenet Fully delegated 3.9
torchvision_inception_v3 Fully delegated 8.6
torchvision_lraspp_mobilenet_v3_large Fully delegated 3.3
torchvision_mnasnet0_5 Fully delegated 2.4
torchvision_mobilenet_v2 Fully delegated 2.8
torchvision_mobilenet_v3_large Fully delegated 2.8
torchvision_mobilenet_v3_small Fully delegated 2.3
torchvision_resnet152 Fully delegated 15.0
torchvision_resnet18 Fully delegated 4.3
torchvision_resnet50 Fully delegated 6.9
torchvision_squeezenet1_0 Fully delegated 2.9
torchvision_squeezenet1_1 Fully delegated 2.5
torchvision_vgg16 Fully delegated 13.4
torchvision_wide_resnet101_2 Fully delegated 25.0
torchvision_wide_resnet50_2 Fully delegated 13.4
u2net_full Fully delegated 98.3
u2net_lite Fully delegated 51.4
hf_distil_whisper_small_no_cache Partially delegated 251.9
hf_distilbert Partially delegated 13.7
hf_tinyroberta_squad2 Partially delegated 17.1
hf_tinyroberta_squad2_dynamic_batch Partially delegated 52.1
snapml_StyleTransferNet Partially delegated 40.9
timm_efficientformer_l1 Partially delegated 17.6
timm_efficientformerv2_s0 Partially delegated 16.1
timm_pvt_v2_b1 Partially delegated 73.5
timm_pvt_v2_b3 Partially delegated 246.7
timm_resnest14d Partially delegated 88.9
torchaudio_conformer Partially delegated 21.5
torchvision_convnext_tiny Partially delegated 8.2
torchvision_maxvit_t Partially delegated 194.0
torchvision_shufflenet_v2 Partially delegated 9.5
torchvision_swin_tiny Partially delegated 164.4
torchvision_video_resnet2plus1d_18 Partially delegated 6832.0
torchvision_video_swin3d_tiny Partially delegated 2617.8
yolox_tiny Partially delegated 11.2